\section{Previous work}

Most work on generic dispatch has been done in the context of more
mainstream programming languages such as \cplusplus{} or \java{}.
There are three aspects of \cl{} that complicate the situation with
respect to such languages:

\begin{enumerate}
\item Multiple inheritance.
\item Multiple dispatch.
\item Interactivity.
\end{enumerate}

Multiple inheritance%
\footnote{\cplusplus{} also has multiple inheritance of course.}
makes it possible for a slot to have a different \emph{position} in
the slot vector in instances of different classes.  Slot accessors
must take this possibility into account.

Multiple dispatch makes it more difficult to use table-based
techniques with entries for each class, because the size of the table
grows exponentially in the number of specialized parameters.
Table-compression techniques help overcome some of these problems, at
the cost of a more complicated, and thus more expensive, dispatch
algorithm. 

Many existing techniques are based on the complete program being
available in order to compute dispatch tables.  The interactive nature
of \cl{} makes it more difficult to use such techniques because every
table may have to be recomputed whenever a small modification is made
to a class.  However, some techniques may amortize this cost by
\emph{invalidating} some tables and recomputing them when required.

The fact that \cl{} is interactive also makes possible the
existence of \emph{obsolete instances}, i.e., instances where the
slots no longer correspond to the definition of the class, so the
instances have to be \emph{updated} before they are inspected.  

\subsection{PCL}

In PCL\footnote{PCL stands for Portable Common Loops.}
\cite{Kiczales:1990:EMD:91556.91600} a standard object is represented
as a two-word header object where the first word is a pointer to a
\emph{class wrapper} and the second word is a pointer to the
\emph{slot vector} of the instance.  The class wrapper is also a
two-word structure that contains a \emph{hash seed} and a pointer to
the class object. 

Each generic function contains a \emph{memoization table}.  Each entry
of the table contains a class wrapper and the entry point for the
effective method to be called for instances of the corresponding
class.  The memoization table uses a simple hashing mechanism, so that
the hash seed of the class wrapper of the argument is reduced modulo
the size of the memoization table in order to find the corresponding
entry.  The class wrapper in the entry is compared using \texttt{eq}
to the class wrapper of the argument, and if they are the same, the
corresponding effective method is called.  When there is no hit, the
reason could be that there is a hash collision, or it could be that no
entry exists in the table for the class of the argument.  Thus, if
there is no initial hit, the table is searched sequentially until an
entry is found or all the entries have been examined.

In the best case then, the following operations are required for a
simple slot reader generic function:

\begin{enumerate}
\item Access the class wrapper of the argument; a memory access.
\item Access the hash seed of the class wrapper; a memory access.
\item Access the size of the memoization table; a memory access.
\item Reduce the hash seed of the class wrapper modulo the size of the
  memoization table; a simple masking operation if the size of the
  table is a power of $2$.
\item Access the memoization table of the generic function; a memory
  access.
\item Access the class wrapper in the memoization table entry; a
  memory access.
\item Compare the class wrapper in the memoization table entry to the
  class wrapper of the argument; a simple register comparison.
\item Access the entry point of the effective method in the
  memoization table entry; a memory access.
\item Jump to the entry point of the effective method; an
  unconditional jump.
\item The effective method accesses the slot vector of the instance; a
  memory access.
\item The slot containing the desired object is read and returned; a
  memory access.
\end{enumerate}

As we can see, there are $8$ memory accesses involved. 

The authors also mention an optimization for slot readers and slot
writers in the case where such a generic function is called with only
a few different classes.  In that case, they suggest replacing the
table lookup with a simple test for the different cases.  However,
since class wrappers are heap-allocated objects, a copying garbage
collector may move them around.  For that reason, class wrappers can
not be inline constants in the code, and must be stored in the generic
function.  If such an optimization is implemented, the mechanism is
reduced to the following steps:

\begin{enumerate}
\item Access the class wrapper of the argument; a memory access.
\item Access one or more class wrappers stored in the generic
  function; at least one memory access.
\item Compare the class wrapper of the argument to the class
  wrapper(s) stored in the generic function; fast register operation.
\item Access the slot vector of the instance; a memory access.
\item The slot containing the desired object is read and returned; a
  memory access.
\end{enumerate}

The minimum number of memory accesses is reduced to $4$ if the generic
function is called with instances of a single class.  

The technique used by PCL automatically detects obsolete instances.
When a class is modified, the current wrapper is eliminated from all
existing memoization tables, forcing the lookup to fail, and thus
triggering the update of the obsolete instance.  

While not mentioned in the published work, PCL also allows for a
discriminating function that tests argument types using
\texttt{typep}.  While in general quite expensive, \texttt{typep} can
be efficient when a built-in type known at compile time is tested for.
If the number of cases is small and the \texttt{typep} test can be
determined to be inexpensive, then this kind of discriminating
function could potentially achieve performance similar to our
technique. 

\subsection{Work by Zendra et al}

Perhaps the work that is most similar to ours is that of Zendra et al
\cite{Zendra:1997:EDD:263698.263728}.  Like the present work, they are
interested in performance improvements to dispatch on modern
architectures by eliminating table lookups.  Unlike the present work,
the context of their work is a static language (Eiffel) which
simplifies many aspect of the optimization.  For one thing, they use
global type inference to optimize away dispatch entirely when not
needed, and they can inline the dispatch mechanism at each call site
because there can be no changes to the class hierarchy or to the
applicable methods at runtime.  Like the present work, they also use
inline code performing a binary search in order to determine which
applicable method to invoke.

\subsection{Other work}

Dreisen et al \cite{Driesen:1995:MDP:646153.679537} give an overview
of different dispatch techniques in the context of modern processor
architectures.  They are specifically interested in the influence of
\emph{processor pipelining} and \emph{superscalar execution} on the
performance of various dispatch techniques.  In addition, they take
into account the influence of dynamic typing on the dispatch cost.  In
particular, they mention the fact that, with multiple inheritance, the
location of a slot may vary according to the exact runtime type of an
object.  Their study is limited to the case of single dispatch.
Nevertheless, they treat both static techniques and dynamic
techniques, including call-site caching.  Most of the techniques used
are based on some kind of table lookup, including the dynamic
techniques containing a table (which can be small) used as a cache.
They conclude that table-based methods are expensive on modern
hardware, and that inline caching is likely to perform the best on
modern processors.

The work of Zibin and Gil \cite{Zibin:2002:FAC:582419.582434} also
discusses table-based techniques.  Their paper addresses multiple
inheritance and multiple dispatch, but concentrates on the efficiency
of the algorithm for building the dispatch table and the size of the
resulting dispatch table. 

In their 2012 paper, Hariskrishnan and Kumar
\cite{Harikrishnan:2012:SEN:2108144.2108153} also sacrifice efficiency
of the dispatch algorithm in favor of space efficiency by removing the
lookup table altogether.  They propose the alternative name
\emph{constant-time} technique for \emph{table-based} technique, and
\emph{non-constant-time} for techniques that do not use table lookup.
Their technique is applicable only to \emph{single inheritance} systems. 

The technical report of Bachrach and Burke \cite{Bachrach:2000} is
concerned with the language Dylan, which is similar to \cl{} in many
respects, although in terms of generic dispatch, it allows for
specializers other than classes and singletons; and features such as
\emph{sealing} of classes and methods allow for further optimizations
of generic dispatch.  Their approach is similar to ours, in that they
construct a \emph{decision tree} for each generic function.  It is
different from ours in that the decision tree is not implemented as
inline code, but instead as a data structure consisting of
\emph{engine nodes}, requiring the dispatch code to make several
memory accesses.  On the other hand, their technique is more flexible
in that it allows for each call site to exploit the decision tree and
optimize according to locally available information, sometimes
resulting in particular call sites not requiring any dispatch code at
all.  Unfortunately, this technical report is in an unfinished state,
making it hard to evaluate the work and the results behind it.

Dujardin et al \cite{Dujardin:1998:FAC:271510.271521} give a fast
algorithm for creating compressed tables for multiple dispatch.  While
the dispatch is still constant time after compression, as with other
table-compression techniques, theirs adds overhead to the dispatch
itself.  The unique number of the type of each argument must be used
to index a per-argument table in order to obtain indices in the
compressed table.  And of course the element in the compressed table
must then be accessed (requiring index arithmetic and a memory access)
before the relevant method can be invoked.  Merging the per-argument
tables using standard table-compression techniques adds yet more
comparisons and memory accesses. 

Like Bachrach and Burke, Hölze and Ungar
\cite{Holzle:1994:ODC:178243.178478} optimize dispatch by specializing
the dispatch algorithm for each call site.  They use run-time type
information as feedback to the compiler which can then often create
more efficient dispatch code simply because for a given call site it
is common that only a small subset of all possible argument types are
actually used.  We have not addressed this possibility, because it
would require a way to recompile the caller, or at least a single call
site, whenever a change to the callee or to the class hierarchy is
made.  
